Overview

The topic of this project focuses on bulk RNA-sequencing of lung alveolar type II cells from hyperoxia-injuryed mice. The goal is to see what is the difference in gene expression between alveolar type II cells derived from alveolar type I cells and the rest of the alveolar type II cells. The unpublished sequencing data used in this project comes from Dr. David B. Frank's lab which I currently work in. I first talked with my PI Dr. David B Frank who is a pediatrian at Children's Hospital of Philadelphia focusing on pulmonary diseases to have a better understanding of the motivation behind this preliminary bulk RNA-sequencing and how the results of this experiment can be applied in the medical field. Then I talked with Dr. Prashant Chandrasekaran who is a PhD in Dr. Frank lab with a lot of wet lab experience with mice to learn the details of the experimental procedures. Finally, I talked with Dr. Mengyuan Kan, a postdoc in the Department of Biostatistics, Epidemiology and Informatics, at the University of Pennsylvania, working with Dr. Blanca E. Hime, who guides me to analyze the bulk RNA-sequencing data and obtain differential gene expression results. The link to my final project GitHun repository is below:

https://github.com/wenhongbo18/BMIN503_Final_Project

Introduction

Bronchopulmonary dysplasia (BPD) is a chronic lung disease common in premature infants caused by aggressive mechanical ventilator approach on a relatively mature lung lacking surfactant (Davidson et al., 2017; Thebaud et al., 2019). The additional pressure from the ventilation will damage the alveoli, where the gas exchange takes place. Short-term symptoms of BPD include rapid, labored breathing, difficult feeding, repeated lung infections and can eventually lead to asthma-like symptoms, exercise intolerance, and pulmonary arterial hypertension. Almost 50,000 infants were born with less than 28 weeks’ gestation each year in the USA, and approximately 35% of these children develop BPD (Thebaud et al., 2019). Due to the lack of understanding on the mechanisms of BPD, choices of prevention and treatment remain limited and inefficient. Lung alveoli are composed of two types of cells: elongated alveolar type I (AT1) cells which are the site of gas exchange and cuboidal alveolar type II (AT2) cells whose major role is to produce surfactant which lowers the surface tension of alveoli and makes sure they do not collpase when expanding. Previous studies have demonstrated the plasticity of AT2 progenitor cells and how they can differentiate and proliferate to heal the injuried alveoli in adult mice (Barkauskas et al., 2013). Compared to AT2 cells, AT1 cells were previously believed to lack plasticity and cannot change their cell fates. However, a recent study shows that upon acute neonatal lung injury, AT1 cells reprogram into AT2 cells, promoting alveolar regeneration (Penkala et al., 2021). Considering these background information, our research group asks the question what signaling pathways govern the plasticity of AT1 cells. The answer to this question will have great application in treating BPD. Researchers can find ways to upregulate or downregulate these signaling pathways to increase plasticity of AT1 cells, thereby promoting the healing of alveolar damage caused by BPD.

An interdisciplinary approach is needed to solve this problem. A strong biology and chemistry background is needed to design the experiment, choose the right mice model, and obtain the specific types of cells. Afterwards, biomedical informatics knowledge comes in to help organize the data, perform statistical analysis, and locate the genes that have the highest difference in expression between two population of the cells. Then, a good interpretation of the data requires a comprehensive understanding of genetics. After finding relevant genes, the search of existing medications that target related signaling pathway and the determination of the effectiveness of such medications need perspectives from the clinical side.

Methods

Mouse deisgn

The mouse model used in this experiment was the HopXCre R26Rtdtomato SPC-GFP mice. Due to the gentic design, all the AT2 cells were marked with a green color by the green fluroscent protein (GFP), but only the AT1-derived AT2 cells also expressed a red color (tdtomato) from the Cre-LoxP lineage tracing design. Because ths is only a preliminary experiment conducted in my lab, only two mice were used named 927_1 and 980_3.

Experimental Procedures Before Sequencing

Since we were interested in exploring the AT1 and AT2 cells fate after injury, the two mice were put in the hyperoxia chamber at 90% of oxygen level for four days to mimic the conditions of BPD. The mice were then transported to live in normal condition until postnatal day 42. The lungs were then harvested, single cell suspension was achieved, and samples were submitted to flow cytometry after appropriate antibody staining.Two cell populations were harvested through flow cytometry: AT1-derived AT2 cells (GFP-tdtomato) and other AT2 cells (GFP). Bulk RNA-sequencing is performed on these two sets of cells to see what difference in gene expression explains this phenotype. The two different groups of cells underwent cell lysis, mRNA extraction, mRNA fragmentation, reverse transcription to cDNA, PCR amplification, ligation of sequence adaptors, and size selection. Afterwards, the samples were put into the flow cells of Illumina sequencer to obtain the sequence of all the cDNAs, and the data was stored in a hard drive.

Server Set up

All the data was processed on a remote server at Dr. Himes lab, so the first step is to transfer all the data into the remote server. A PMACS account was set up for me to use. Linux command is used below.

  1. Login to the server, use the command:

ssh hwen@bhimes.pmacs.upenn.edu

  1. Under the RNASeq folder, create a folder called Lung_AT2:

mkdir -p /projects/RNASeq/Lung_AT2

  1. Transfer all the sequencing data into this folder:

scp -r (full path of the data folder on my computer) hwen@bhimes.pmacs.upenn.edu:/projects/RNASeq/Lung_AT2 df -h

  1. Initialize the python 2.7:

conda activate py2

Quality Control

All the codes were modified based on a pipeline designed by Dr. Himes lab. Below is the link to the github repository for the codes:

https://github.com/HimesGroup/taffeta

Linux Command

  1. Create a folder named files and inside that folder create a txt file called Lung_AT2_Phenotype_withoutQC.txt.

The Lung_AT2_Phenotype_withoutQC.txt file has 8 columes:

Sample: Sample name Donor: Names of the mice that the samples were from Tissue: Type of the tissue the cells came from. Here it is alveolar type 2 cells from lung tissues. Treatment: Hyperoxia injury Disease: All mice were healthy beforehand Status: This used for comparison. We have two groups: AT2 cells derived from AT1 cells (AT2_AT1) and other AT2 cells (AT2) R1: directory of the R1 read (5' direction of the top strand) R2: directory of the R2 read (5' direction of the bottom strand)

mkdir -p /projects/RNASeq/Lung_AT2/files

cd /projects/RNASeq/Lung_AT2/files

cat > Lung_AT2_Phenotype_withoutQC.txt

  1. Run pipeline scripts/rnaseq align and qc.py to: 1) trim adapter and primer sequences if index information is available, 2) run FastQC for (un)trimmed .fastq files, 3) align reads and quantify reads mapped to genes/transcripts, and 4) obtain various QC metrics from .bam files. At the end of these steps, the readcount file was generated:

mkdir -p /projects/RNASeq/Lung_AT2/scripts/align

cd /projects/RNASeq/Lung_AT2/scripts/align

rnaseq_align_and_qc.py --project_name Lung_AT2 --samples_in /projects/RNASeq/Lung_AT2/files/Lung_AT2_Phenotype_withoutQC.txt --aligner star --ref_genome mm10 --library_type PE --strand nonstrand --path_start /projects/RNASeq/Lung_AT2 --template_dir /projects/pipelines/RNASeq

mk_lsf2bashsub.py --path /projects/RNASeq/Lung_AT2/scripts/align --lsf_dir /projects/RNASeq/Lung_AT2/scripts/align --job_name align --nthread 1 & echo $! > /projects/RNASeq/Lung_AT2/scripts/align/align.jobid

jobid=cat /projects/RNASeq/Lung_AT2/scripts/align/align.jobid ps -A | grep $jobid

  1. create an HTML report of QC and alignment summary statistics for RNA-seq samples:

mkdir -p /projects/RNASeq/Lung_AT2/scripts/qc_report

cd /projects/RNASeq/Lung_AT2/scripts/qc_report

rnaseq_align_and_qc_report.py --project_name Lung_AT2 --samples_in /projects/RNASeq/Lung_AT2/files/Lung_AT2_Phenotype_withoutQC.txt --aligner star --ref_genome mm10 --library_type PE --strand nonstrand --path_start /projects/RNASeq/Lung_AT2 --template_dir /projects/pipelines/RNASeq

bash /projects/RNASeq/Lung_AT2/scripts/qc_report/Lung_AT2_qc.lsf > Lung_AT2_qc.log 2>&1 & echo $! > /projects/RNASeq/Lung_AT2/scripts/qc_report/qc_report.jobid

jobid=cat /projects/RNASeq/Lung_AT2/scripts/qc_report/qc_report.jobid ps -A | grep $jobid

R Command

Aligner: STAR (2.7.6a)

Genome: For mouse, the UCSC mm10 assembly available in iGenomes was used.

Informatics tools used:

  • Trimmomatic (0.39)
  • FastQC (v0.11.9)
  • STAR (2.7.6a)
  • samtools (1.10)
  • bamtools (2.5.1)
  • Picard Tools (2.22.0)

Sequencing parameters:

  • library_type = PE
  • strand = nonstrand
  • ref_genome = mm10
project_name="Lung_AT2"
path.start="/projects/RNASeq/Lung_AT2/Lung_AT2_Alignment_QC_Report_star/"
sample.names.orig <- c('DF_927_1_GFP_tdt', 'DF_927_1_GFP', 'DF_980_3_GFP_tdt', 'DF_980_3_GFP')
sample.names <- sample.names.orig
genome="mm10"
library_type="PE"
aligner="star"
sample_info_file='/projects/RNASeq/Lung_AT2/files/Lung_AT2_Phenotype_withoutQC.txt'
count_data_file='/projects/RNASeq/Lung_AT2/Lung_AT2_Alignment_QC_Report_star/Lung_AT2_htseq_gene.txt'

For each sample, the following programs were run to generate the data necessary to create this report. Written as for unstranded paired-end data. For single-end reads, R2s and insert size metrics would be omitted.

java -Xmx1024m TrimmomaticPE -phred33 [raw_sample_R1] [raw_sample_R2] [sample_R1] [sample_R1_unpaired] [sample_R2] [sample_R2_unpaired] HEADCROP:[bases to trim, if any] ILLUMINACLIP:[sample_primer_fasta]:2:30:10 MINLEN:50

fastqc [sample_R1] [sample_R2]

cat [sample_R1/R2] | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(count[read]==1){unique++}};print total,unique,unique*100/total}'

The following STAR options were used:

STAR --genomeDir [ref_genome_index] --runThreadN 12 --outReadsUnmapped Fastx --outMultimapperOrder Random --outSAMmultNmax 1 --outFilterIntronMotifs RemoveNoncanonical --outSAMstrandField intronMotif --outSAMtype BAM SortedByCoordinate --readFilesIn [sample_R1] [sample_R2]

Using aligned output files accepted_hits.bam and unmapped.bam:

samtools sort accepted_hits.bam accepted_hits.sorted

samtools index accepted_hits.sorted.bam

samtools idxstats accepted_hits.sorted.bam > accepted_hits.sorted.stats

bamtools stats -in accepted_hits.sorted.bam > accepted_hits.sorted.bamstats

bamtools filter -in accepted_hits.sorted.bam -script cigarN.script | bamtools count

samtools view -c unmapped.bam

java -Xmx2g -jar CollectRnaSeqMetrics.jar REF_FLAT=[ref_flat file] STRAND_SPECIFICITY=NONE INPUT=accepted_hits.bam OUTPUT=RNASeqMetrics

java -Xmx2g -jar CollectInsertSizeMetrics.jar HISTOGRAM_FILE=InsertSizeHist.pdf INPUT=accepted_hits.sorted.bam OUTPUT=InsertSizeMetrics (for paired-end library)

setwd("/Users/tony/Desktop/BMIN503_Final_Project/Lung_AT2_Alignment_QC_Report_star")
metrics.data <- read.table(paste(project_name,"_rnaseqmetrics_hist.txt", sep=""), header=T)
counts.data <- read.table(paste(project_name,"_counts.txt", sep=""), header=T, sep="\t")
#ercc.data <- read.table(paste(project_name,"_ERCC.txt", sep=""), header=T, as.is=T)
summary.data <- read.table(paste(project_name,"_rnaseqmetrics_summary.txt", sep=""), header=T, as.is=T, sep="\t")
bamstats.data <- read.table(paste(project_name,"_bamstats_counts.txt", sep=""), header=T, as.is=T, sep="\t")
if (library_type %in% c("PE")) {
  insert.summary.data <- read.table(paste(project_name,"_insertmetrics_summary.txt", sep=""), header=T, as.is=T, sep="\t")
  insert.metrics.data <- data.frame(c(0:1))
  names(insert.metrics.data) <- "Insert_Size"
  for (i in c(1:length(sample.names))){
    curr.hist.data <- read.table(paste(project_name,"_",sample.names[i],"_insertmetrics_hist.txt", sep=""), header=T, as.is=T, sep="\t")
    insert.metrics.data <- merge(insert.metrics.data, curr.hist.data, all=TRUE)
  }
  write.table(insert.metrics.data,paste(project_name,"_insertmetrics_hist.txt", sep=""), col.names=T, row.names=F, sep='\t', quote=F)
}
unique.counts.data <- read.table(paste(project_name,"_unique_counts.txt", sep=""), header=T, sep="\t")
duplicates <- read.table(paste(project_name,"_duplicates.txt", sep=""), header=T, sep="\t", as.is=T)

Differential Gene Expression

All the codes were modified based on a pipeline designed by Dr. Himes lab. Below is the link to the github repository for the codes:

https://github.com/HimesGroup/taffeta

Linux Command

  1. Inside the file folder create two txt files called Lung_AT2_Phenotype_withQC.txt, and Lung_AT2_comp_file.txt.

The Lung_AT2_Phenotype_withoutQC.txt file has 9 columes:

Sample: Sample name Donor: Names of the mice that the samples were from Tissue: Type of the tissue the cells came from. Here it is alveolar type 2 cells from lung tissues. Treatment: Hyperoxia injury Disease: All mice were healthy beforehand Status: This used for comparison. We have two groups: AT2 cells derived from AT1 cells (AT2_AT1) and other AT2 cells (AT2) R1: directory of the R1 read (5' direction of the top strand) R2: directory of the R2 read (5' direction of the bottom strand) QC_Pass: 1 if the sample QC is good and can be included in the following DE analysis; 0 if the sample QC is bad and should be excluded in the following DE analysis. (Note: this is done after the QC report is generated. All the samples have good QCs and can be used in the following DE analysis)

The Lung_AT2_comp_file.txt file has 3 columes:

Condition1: AT2 cells derived from AT1 cells (AT2_AT1) Condition0: other AT2 cells (AT2) Design: paired:Donor. This specifies whether this sample is paired with another sample for comparison. All of the 4 samples used were paired.

mkdir -p /projects/RNASeq/Lung_AT2/files

cd /projects/RNASeq/Lung_AT2/files

cat > Lung_AT2_Phenotype_withQC.txt

cat > Lung_AT2_comp_file.txt

  1. Perform DE analysis and create an HTML report of differential expression summary statistics. At the end of these steps, the readcount file was generated:

mkdir -p /projects/RNASeq/Lung_AT2/scripts/deseq2

cd /projects/RNASeq/Lung_AT2/scripts/deseq2

rnaseq_de_report.py --project_name Lung_AT2 --samples_in /projects/RNASeq/Lung_AT2/files/Lung_AT2_Phenotype_withQC.txt --comp /projects/RNASeq/Lung_AT2/files/Lung_AT2_comp_file.txt --de_package deseq2 --ref_genome mm10 --path_start /projects/RNASeq/Lung_AT2 --template_dir /projects/pipelines/RNASeq

bash /projects/RNASeq/Lung_AT2/scripts/deseq2/Lung_AT2_deseq2.lsf > Lung_AT2_deseq2.log 2>&1 & echo $! > /projects/RNASeq/Lung_AT2/scripts/deseq2/Lung_AT2_deseq2.jobid

jobid=cat /projects/RNASeq/Lung_AT2/scripts/deseq2/Lung_AT2_deseq2.jobid ps -A | grep $jobid

R Command

Reads were aligned to the mm10 assembly using STAR (2.7.6a). The following alignment QC report was produced:

Lung_AT2_QC_RnaSeqReport.html

HTSeq (0.11.3) function htseq-count was used to count reads. Counts for all samples were concatenated into the following text file:

Lung_AT2_htseq_gene.txt

DESeq2 (1.28.1) was used for differential gene expression analaysis, based on the HTSeq counts matrix and the phenotype file provided. Normalized counts from DESeq2 are saved in the following text file:

Lung_AT2_counts_normalized_by_DESeq2.txt

Normalized counts are obtained from DESeq2 function estimateSizeFactors(), which divides counts by the geometric mean across samples; this function does not correct for read length. The normalization method is described in detail here: https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-10-r106

Differential gene expression analysis was done for all comparisons provided in the comparisons file. The following design was used:

design = ~ + Donor + Status

If desired, the design can be modified to include more independent variables. In addition to the partial results displayed in this report, the full set of DESeq2 results for each comparison was saved down in separate text files, with names of the form:

Lung_AT2_CASE_vs_CONTROL_DESeq2_results.txt

where CASE and CONTROL are pairs of conditions specified in the comparisons file.

project_name="Lung_AT2"
path.start='/Users/tony/Desktop/BMIN503_Final_Project/'
housekeeping_genes <- c('Actb','Gapdh','B2m','Rpl19','Gabarap')
house_list=list() # create an empty list to save results of house-keeping genes
ref_genome <- 'mm10'
convert_dataset <- 'mmusculus_gene_ensembl'
fav_genes <- c('ACE2', 'TMPRSS2')
fav_list <- list()
coldata <- read.table('/Users/tony/Desktop/BMIN503_Final_Project/files/Lung_AT2_Phenotype_withQC.txt', sep='\t', header=TRUE)
coldata <- subset(coldata, QC_Pass==1)
countdata <- read.table(paste0(path.start, project_name,'_Alignment_QC_Report_star/', project_name, '_htseq_gene.txt'), sep='\t', header=TRUE, check.names=FALSE)
names(countdata) <- gsub('count_','',names(countdata))
countdata <- countdata[,c('Gene',as.character(coldata$Sample))]

row.names(countdata) <- countdata$Gene
is_ensg <- if (substr(rownames(countdata)[1], 1, 4) == 'ENSG') {TRUE} else {FALSE}
countdata$Gene <- sapply(strsplit(as.character(countdata$Gene), '\\.'), '[[', 1) # remove .5 in transcript ENSG00000000005.5
mart <- useMart(biomart='ENSEMBL_MART_ENSEMBL', host='jan2019.archive.ensembl.org', path='/biomart/martservice' ,dataset='mmusculus_gene_ensembl')
genes <- biomaRt::getBM(attribute=c('ensembl_gene_id', 'mgi_symbol'), values=countdata$Gene, mart=mart)
genes <- genes[!duplicated(genes$ensembl_gene_id),]
if (is_ensg) {countdata <- merge(countdata, genes, by.x='Gene', by.y='ensembl_gene_id')} else {countdata <- countdata}
if (is_ensg) {countdata <- rename(countdata, c('mgi_symbol'='gene_symbol'))} else {countdata$gene_symbol <- countdata$Gene}
if (is_ensg) {row.names(countdata) <- countdata$Gene} else {row.names(countdata) <- row.names(countdata)}
if (is_ensg) {countdata$gene_symbol[which(countdata$gene_symbol=='')] <- NA} # assign NA to genes without gene_symbol

Convert KEGG and REACTOME pathway files in .gmt format provided by MSigDB into a pathway list

pathways.msigkegg <- gmtPathways("/Users/tony/Downloads/c2.cp.kegg.v7.4.symbols.gmt")
pathways.msigreactome <- gmtPathways("/Users/tony/Downloads/c2.cp.reactome.v7.4.symbols.gmt")
pathways.msigkeggreac <- c(pathways.msigkegg,pathways.msigreactome)

Results

Quality Control

Summary Read Numbers

The number of raw reads correspond to those that passed Casava QC filters, were trimmed to remove adaptors by Trimmomatic, and were aligned by STAR to ref_genome+ERCC transcripts as reported in .info files. Unique read counts were obtained by using awk on trimmed fastq files. FastQC estimates of percentage of sequences remaining after deduplication were retrieved from fastqc_data.txt files. Bamtools statistics were based on sorted and indexed bam files. The mapped reads were those that mapped to reference and were output by STAR to accepted_hits.bam. The unmapped reads were output by STAR to unmapped.bam. Some reads may be mapped to multiple locations in the genome so that the number of total reads reported by bamstats may be greater than the number of raw reads. The Junction spanning reads are computed based on accepted_hits.bam CIGAR entries containing "N." Related text files that were saved:

Lung_AT2 _read_counts.txt

Lung_AT2 _duplicates.txt

Lung_AT2 _unique_counts.txt

Lung_AT2 _bamstats_counts.txt

Total Number of Raw Reads Summary

Read counts are shown by per million reads.

bamstats.summary <- bamstats.data
row.names(bamstats.summary)=bamstats.summary$Type
bamstats.summary$Type=NULL
if (library_type %in% c("PE")) { # total read counts from fastq
  total_reads=unique.counts.data$R1_Raw_Reads+unique.counts.data$R2_Raw_Reads
  bamstats.summary <- bamstats.summary[!row.names(bamstats.summary)%in%c("Failed QC","Duplicates"), , drop=FALSE]
} else {
  # total read counts from fastq
  total_reads=unique.counts.data$Raw_Reads
  bamstats.summary <- bamstats.summary[!row.names(bamstats.summary)%in%c("Failed QC","Duplicates","Paired-end reads"), , drop=FALSE]
}
bamstats.summary["Total reads",]=total_reads
unmapped_reads <- bamstats.summary["Total reads",] - bamstats.summary["Mapped reads",]
row.names(unmapped_reads) <- "Unmapped reads"
bamstats.summary <- rbind(bamstats.summary, unmapped_reads)
DT::datatable(bamstats.summary, options = list(pageLength = 25))

Plot: Sequence Duplication Level

dup.data <- duplicates
dup.data <- dup.data[which(dup.data$Read_Number!="Total Deduplicated Percentage"),]
dup.data$Read_Number <- 1:(nrow(duplicates)-1)
dup.data <- dup.data %>% tidyr::gather(Sample,value,-Read_Number)
# dup.data$Sample2 <- sapply(as.character(dup.data$Sample), function(x){strsplit(x, "_R1|_R2")[[1]]}) # this command does not work for samples with _R1 and _R2 already in the sample name
# create a temperaray sample data frame
sampleframe_R1=data.frame(Sample2=sample.names, Sample=paste0(sample.names,"_R1"))
sampleframe_R2=data.frame(Sample2=sample.names, Sample=paste0(sample.names,"_R2"))
sampleframe = rbind(sampleframe_R1, sampleframe_R2)
# create Sample2 column without read "_R1" or "_R2" suffix
dup.data <- merge(sampleframe, dup.data, by="Sample", all.y=T)

nsamp=ncol(duplicates)-1
# dup.org <- 1:(nrow(duplicates)-1)
# shift=unlist(lapply(1:nsamp, function(i){dup.org+delta*(i-1)}))
# dup.data$Read_Number <- shift
c <- rep(brewer.pal(12,"Paired"), nsamp)
# plot
ggplot(dup.data, aes(x=Read_Number,y=value,group=Sample2,color=Sample2)) +
    geom_line() +
    ggtitle(project_name) +
    xlab("Sequence Duplication Level") +
    ylab("Percentage of Total Sequences") +
    scale_x_continuous(breaks=seq(0,nrow(duplicates)-1,by=2)) +
    ylim(0, 100) +
    #scale_y_continuous(breaks=seq(0,100,by=20)) +
    scale_color_manual(values = c) +
    theme_bw() +
    theme(
        plot.title = element_text(size=18, hjust=0.5, face="bold"),
        legend.title = element_blank(),
        legend.text = element_text(size = 16),
        axis.text.y = element_text(size=16),
        axis.text.x  = element_text(size=16),
        axis.title = element_text(size=18))

Percentage of Mapped/Unmapped Reads

DT::datatable(bamstats.percent.table[c("Mapped reads","Unmapped"),], options = list(pageLength = 25))

Plot: Percentage of Mapped/Unmapped Reads

mapped.percent.for.plot <- rbind(
  data.frame(
    variable=colnames(bamstats.percent.table),
    value=as.numeric(bamstats.percent.table["Mapped reads",]),
    Type=rep("Mapped",ncol(bamstats.percent.table))),
  data.frame(
    variable=colnames(bamstats.percent.table),
    value=as.numeric(bamstats.percent.table["Unmapped reads",]),
    Type=rep("Unmapped",ncol(bamstats.percent.table)))
)
mapped.percent.for.plot$Type <- factor(mapped.percent.for.plot$Type, levels=c("Unmapped", "Mapped")) # order so mapped reads are at the bottom 

ggplot(data = mapped.percent.for.plot, aes(x = variable, y = value, fill=Type)) + 
    geom_bar(stat="identity") +
    scale_fill_manual(values=c("navy", "firebrick")) +
    labs(title=project_name, x="Sample", y="Percentage of Total Reads") +
    ylim(0, 100) +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, size=14),
        legend.title = element_blank(),
        legend.text = element_text(size = 16),
            axis.text.y = element_text(size=16),
            plot.title = element_text(size=18, hjust = 0.5, face="bold"),
            axis.title.x = element_text(size=18),
            axis.title.y = element_text(size=18))

Plot: Percentage of Junction Spanning Reads Among Mapped Reads

junc.for.table <- data.frame(
    Sample=colnames(bamstats.percent.table),
    value=as.numeric(bamstats.percent.table["Junction Spanning Reads",])
)

ggplot(data = junc.for.table, aes(x = Sample, y = value)) + 
    geom_bar(stat="identity", fill="firebrick") +
    labs(title=project_name, x="Sample", y="Percentage of Junction Spanning Reads Among Mapped Reads") +
    ylim(0, ceiling(max(junc.for.table$value, na.rm=T))) +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, size=14),
        legend.title = element_blank(),
        legend.text = element_text(size = 16),
            axis.text.y = element_text(size=16),
            plot.title = element_text(size=18, hjust = 0.5, face="bold"),
            axis.title.x = element_text(size=18),
            axis.title.y = element_text(size=18))

RnaSeqMetrics Summary

The Picard Tools RnaSeqMetrics function computes the number of bases assigned to various classes of RNA. It also computes the coverage of bases across all transcripts (normalized to a same-sized reference). Computations are based on comparison to a refFlat file. Related text files that were saved:

Lung_AT2 _rnaseqmetrics_summary.txt

Lung_AT2 _rnaseqmetrics_hist.txt

The Picard Tools RnaSeqMetrics function computes the number of bases assigned to various classes of RNA. It also computes the coverage of bases across all transcripts (normalized to a same-sized reference). Computations are based on comparison to a refFlat file. Related text files that were saved:

Lung_AT2 _rnaseqmetrics_summary.txt

Lung_AT2 _rnaseqmetrics_hist.txt

Plot: Percentages of Total Mapped Bases Mapping to mRNA, Intronic and Intergenic Regions

sum.data.for.plot <- sum.data[c("PCT_INTERGENIC_BASES", "PCT_INTRONIC_BASES", "PCT_MRNA_BASES"),]
rownames(sum.data.for.plot) <- c("Intergenic", "Intronic", "mRNA")
sum.data.for.plot$which  <- rownames(sum.data.for.plot)
sum.data.melted <- sum.data.for.plot %>% tidyr::gather(Sample,value,-which)
sum.data.melted$value <- as.numeric(sum.data.melted$value)
sum.data.melted$which <- factor(sum.data.melted$which, levels=c("mRNA", "Intronic", "Intergenic")) # desired order

ggplot(data = sum.data.melted, aes(x = Sample, y = value, fill=which)) + 
    geom_bar(stat="identity", position = "dodge") +  #note stacked is default for ggplot2, so must specify "dodge" to get side-by-side bars
    scale_fill_manual(values=c("firebrick", "darkblue", "darkgreen")) +
    labs(title=project_name, x="Sample", y="Percentage of Total Mapped Bases") +
    ylim(0, 100) +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, size=16),
        legend.title = element_blank(),
        legend.text = element_text(size = 16),
            axis.text.y = element_text(size=16),
            plot.title = element_text(size=20, hjust = 0.5, face="bold"),
            axis.title.x = element_text(size=18),
            axis.title.y = element_text(size=18))

Plot: Normalized Coverage

pos.data <- metrics.data %>% tidyr::gather(Sample,coverage,-Normalized_Position)
nsamp=ncol(metrics.data)-1
# add slight shift to normalized position
delta <- 1/(4*length(nsamp-1))
pos.org <- metrics.data$Normalized_Position
shift=unlist(lapply(1:nsamp, function(i){pos.org+delta*(i-1)}))
pos.data$Normalized_Position <- shift
# plot parameters
x.max <- max(metrics.data$Normalized_Position)
y.max <- max(metrics.data[ ,c(2:ncol(metrics.data)), drop=FALSE])
c <- rep(brewer.pal(12,"Paired"), nsamp) 
ggplot(pos.data, aes(x=Normalized_Position,y=coverage,group=Sample,color=Sample)) + 
    geom_line() +
    theme_bw() +
    ggtitle(project_name) +
    xlab("Normalized Position") +
    ylab("Normalized Coverage") +
    scale_x_continuous(breaks=seq(0,x.max,by=20)) +
    scale_y_continuous(breaks=seq(0,y.max,by=0.2)) +
    scale_color_manual(values = c) +
    theme(
        legend.title = element_blank(),
    legend.text = element_text(size = 16),
    plot.title = element_text(size=20, hjust = 0.5, face="bold"),
        axis.text = element_text(size=16),
        axis.title = element_text(size=18)
    )

InsertSizeMetrics Summary

For paired-end data, the Picard Tools CollectInsertSizeMetrics function was used to compute the distribution of insert sizes in the accepted_hits.bam file and create a histogram. Related text files that were saved:

Lung_AT2 _insertmetrics_summary.txt

# Insert Size Summary
row.names(insert.summary.data) <- insert.summary.data$Type
insert.summary.data$Type <- NULL
metrics_row <- c("MEDIAN_INSERT_SIZE", "MEDIAN_ABSOLUTE_DEVIATION", "MIN_INSERT_SIZE", "MAX_INSERT_SIZE", "MEAN_INSERT_SIZE", "STANDARD_DEVIATION", "READ_PAIRS")
insert.summary.data <- insert.summary.data[metrics_row, ]
DT::datatable(insert.summary.data, options = list(pageLength = 25))

Plot: Median of Insert Size

insert.size <- data.frame(
    value=as.numeric(as.character(unlist(insert.summary.data["MEDIAN_INSERT_SIZE",]))),
    Sample=colnames(insert.summary.data))
ggplot(insert.size, aes(x=Sample, y=value)) + geom_bar(stat="identity",fill="firebrick") +
    ggtitle(project_name) +
    xlab("Sample") +
    ylab("Median Insert Size") +
    theme_bw() +
    theme(
        plot.title = element_text(size=20, hjust = 0.5, face="bold"),
    legend.title = element_blank(),
    legend.text = element_text(size = 16),
        axis.text.y = element_text(size=16),
    axis.text.x  = element_text(angle=90, hjust=1, size=16),
        axis.title = element_text(size=18))

Differential Expression Analysis: AT2_AT1 vs. AT2

Samples in this comparison

DE analysis

Comparison Summary
Status Count
AT2 2
AT2_AT1 2

All 113 Differentially Expressed Genes by Adjusted P-value

#output table for report
#select top 50 by pvalue
#then round to 2 decimal points for output table
res_df_outp <- head(res_df[order(res_df$padj, decreasing=FALSE),], 50)
res_df_outp[,c("baseMean", "log2FoldChange", "lfcSE", "stat")] <- round(res_df_outp[,c("baseMean", "log2FoldChange", "lfcSE", "stat")], 2)   
res_df_outp[,c("pvalue")] <- formatC(res_df_outp[,c("pvalue")], format = "e", digits = 2) # do this in 2 steps, else get "Error in is.finite(x) : default method not implemented for type 'list'"
res_df_outp[,c("padj")] <- formatC(res_df_outp[,c("padj")], format = "e", digits = 2) 
DT::datatable(res_df_outp, rownames=FALSE, options = list(columnDefs = list(list(className = 'dt-center', targets = "_all"))))

Description of DESEq2 output

Volcano plots

Volcano plot (probes with a q-value <0.05 are present in red)

# The volplot_func function generates volcano plots
volplot_func <- function(df,qval_column,title) {
  # get qvalue column
  qval <- df[,qval_column]
  if (all(is.na(qval))) {message("The batch and status are highly confounded. Batch effect is not adjusted.")} else {         
    df <- df[!is.na(qval),] # remove NA values
    qval <- df[,qval_column]
    if (min(df[,qval_column])>=0.05) {
      df$sig <- "black" # assign colors to DE and non-DE genes
    } else {
      # assign colors to DE and non-DE genes
      df$sig <- rep(NA,nrow(df))
      df$sig[qval<0.05] <- "red"
      df$sig[qval>=0.05] <- "black"
    }
    df$sig <- as.factor(df$sig)
    color <- levels(df$sig)
    # log10 transformed q values
    df$logqval <- -log10(qval)
    diffgenes <- df$Gene[qval<0.05] #Create list of all DEG's
    signum = paste0(length(diffgenes), " differentially expressed genes")
    #cat(length(diffgenes), "differentially expressed genes have been identified based on", qval_column)
    if (missing(title)) {title=signum}
    print(
      ggplot(df, aes(x = log2FoldChange, y = logqval, color=sig)) + geom_point(size=0.5) +
        theme_bw() +
        labs(title=title,x="log2FoldChange",y=paste0("-log10(",qval_column,")")) +
        scale_color_manual(values=color) +
        theme(legend.position="none")
    )
  }
}
for (qval in c("padj")) {volplot_func(df=res_df, qval_column=qval)}

Heatmaps for top 30 significant genes

Genes were ranked by adjusted p-values.

# The heatmap_topgene_func function for top gene heatmap plots
heatmap_topgene_func <- function(tb, topnum=30, main="") {
  m=log2(dds_count+1)
  top.mat <- m[rownames(m)%in%tb[1:topnum,"Gene"],] # plot heatmap for top genes
  array_name <- colnames(m)
  gene_symbol=tb[1:topnum,"gene_symbol"]
  heatmap.2(na.omit(top.mat), col=viridis(256, option="B"),
            ColSideColors=colour_status_list, # use predefined colour_status_list, assign colors to status
            labCol=array_name, labRow=gene_symbol, # take out gene probe id
            trace="none",
            margins=c(12,20), # (bottom margin, left margin)
            cexRow=1,cexCol=1,
            keysize=1.5,key.title=NA,key.xlab="Log2-normalized counts",key.ylab="Counts",
            main=main)
  legend("bottomleft",legend=names(colour_status),fill=colour_status,cex=0.8) # use predifined colour_status
}
heatmap_topgene_func(tb=res_df, topnum=30, main="")

Boxplots for Relevant Differentially Expressed Genes

Genes were ranked by pvalue. Counts have been normalized by sequencing depth, with pseudocount of 0.5 added to allow for log scale plotting, using DESeq2 function plotCounts(). Out of the 118 differentially expressed genes, 26 relevant genes were selected.

boxplot_func <- function(df) {
  gene_symbol=unique(df$gene_symbol)
  gene_id=unique(df$Gene)
  ggplot(df, aes(x = Status, y = count, fill=Status)) +
    geom_boxplot(outlier.colour=NA, lwd=0.2, color="grey18") + 
    stat_boxplot(geom ='errorbar', color="grey18") + 
    expand_limits(y=0) +
    geom_jitter(size=0.5, width=0.2) + 
    guides(fill=FALSE) +
    theme_bw() +  
    labs(title=paste0(gene_id)) +
    #labs(x="condition") +
    labs(y="Normalized counts") +
    theme(text = element_text(size=9), 
          strip.text.x = element_text(size = 10),
          #axis.text.x = element_text(size=12),
          axis.text.x = element_text(angle = 90, hjust = 1, size=12),
          axis.text.y = element_text(size=9),
          plot.title = element_text(size=12),
          axis.title.x = element_blank(),
          axis.title.y = element_text(size=12))
}
library(readxl)
favorite_gene <- read_excel("/Users/tony/Desktop/BMIN503_Final_Project/Lung_AT2_deseq2_out/favorite gene.xlsx")
favorite_gene <- as.data.frame(favorite_gene)
topnum=26
for (i in 1:topnum) {
  gene_id <- favorite_gene[i,"Gene"]
  gene_symbol <- favorite_gene[i,"gene_symbol"]
  curr_data <- res_ct_df[which(res_ct_df$Gene==gene_id),]
  print(boxplot_func(df=curr_data)+scale_fill_manual(values=colour_status,na.value="grey"))
}

Gene-set enrichment analysis

View top pathways in fgsea results. Only include the first five leading genes

## 57 main pathways are significant.
if (nrow(res_show)==0) {
  res_show <- res[main_pathway=="main"][1:10] # if no significant pathway, take the top 10 main pathways
}
res_show <- res_show %>%
  dplyr::mutate(pval=round(pval,4), padj=round(padj,4), ES=round(ES,3), NES=round(NES,3))
leadingEdge <- rep(NA, nrow(res_show))
leadingEdge <- sapply(res_show$leadingEdge, function(x)paste(x[1:5],collapse=","))
res_show$leadingEdge <- NULL
res_show$leadingEdge <- leadingEdge
DT::datatable(res_show[1:50], rownames=FALSE, options = list(columnDefs = list(list(className = 'dt-center', targets = "_all"))))

Generate barplot for the pathways of interest

dat_barplot <- data.frame(pathway=res_barplot$pathway, NES=res_barplot$NES)
dummies=paste0(rep(LETTERS,each=length(LETTERS)),rep(LETTERS,length(LETTERS)))
dummy <- data.frame(pathway=c(dat_barplot[order(-dat_barplot$NES),"pathway"]),dummy=dummies[1:nrow(dat_barplot)])
dat_barplot <- merge(dat_barplot,dummy,by="pathway")
labels <- as.character(dummy$pathway)
labels <- unname(sapply(labels,textconv_func))
ggplot(dat_barplot,aes(y=NES,x=dummy)) + geom_bar(width=0.8, position=position_dodge(width=0.8), stat="identity", fill="#006d2c") + 
  coord_flip() +
  scale_x_discrete(labels=labels) +
  ylab("normalized enrichment score")+
  theme_bw()+
  theme(
    axis.title.y=element_blank(),
    axis.text.y=element_text(size=11),
    axis.title.x=element_text(size=9))

View leading edges in pathways of interest

top = favorite_path$pathway
  for (i in top[!is.na(top)]) {
    print(plotEnrichment(pathway = pathways.msigkeggreac[[i]], stats = gene_stat) + labs(title=i))
  }

Housekeeping genes

Counts have been normalized by estimated size factors using DESeq2. Obtain the count matrix using function DESeq2::counts.

The table shows p-values of house-keeping genes for each comparison. Generally, house-keeping gene expressions do not change significantly in different conditions.

if (exists("housekeeping_genes")) {
  sel_ids=as.character() # save selected Ensembl ID for house-keeping genes
  for (i in housekeeping_genes) {
    gene_symbol <- i
    gene_ids <- norm.counts[which(norm.counts$gene_symbol==i),'Gene']
    curr_data <- norm.counts[which(norm.counts$gene_symbol==i),names(norm.counts)[names(norm.counts)%in%coldata$Sample]]
    curr_data <- curr_data[which.max(rowSums(curr_data)),]
    gene_id <- gene_ids[which.max(rowSums(curr_data))] # choose Ensembl gene with the most counts
    sel_ids <- c(sel_ids, gene_id)
    curr_data <- data.frame(Sample=colnames(curr_data),Gene=rep(gene_id,ncol(curr_data)),gene_symbol=rep(gene_symbol,ncol(curr_data)),count=as.numeric(curr_data))
    curr_data <- merge(curr_data,coldata[which(coldata$Sample%in%curr_data$Sample),c('Sample','Status')],by='Sample')
    print(boxplot_func(df=curr_data))  }
  dat <- do.call(rbind,lapply(1:length(house_list),function(x){dat=house_list[[x]];dat$Comparison=names(house_list)[x];dat[which(dat$Gene%in%sel_ids),c('gene_symbol','pvalue','Comparison')]})) # obtain all p-value for house-keeping genes
  dat[,c('pvalue')] <- formatC(dat[,c('pvalue')], format = "e", digits = 2)
  dat <- dat %>% spread(gene_symbol, pvalue)
  DT::datatable(dat, rownames=FALSE, options = list(columnDefs = list(list(className = 'dt-center', targets = "_all"))))
}

Conclusion

Quality Control

The quality of these four samples are good. We can see that even though the percentage of mapped versus unmapped reads are a little bit low (65%), all four samples have around 150 million reads, which is enough for the bulk RNA sequencing to work. The sequence duplication level is relatively low with an expected peak at sequence duplication level 10. The percentage of junction spanning reads among mapped reads is high enough (30%). Most of the sequences (75%) are mRNA, which means there is little amount of DNA contamination. All four samples have the same pattern for the Normalized coverage. Although the reads are bit screwed to the 5’ end, my PI informed me that this is not uncommon. The median insert sizes are approximately 400 for each sample. In the end, I think it is safe to say that I can include all my samples in the differential gene analysis part.

At the end, we also showed that there is no difference in housekeeping genes between AT2 and AT1_AT2, which is expected.

Differential Gene Expression Analysis

Out of the 118 statistically significant difference in gene expression between AT2 and AT1_AT2, I found 26 genes of interest. Below I will discuss several important genes in details.

Sftpc, Abca3, Lamp3

Sftpc gene encodes for surfactant protein C. Abca3 gene encodes for ATP-binding cassette sub-family A member 3. Lamp3 gene encodes for Lysosomal Associated Membrane Protein 3. These three proteins are highly expressed in AT2 cells; therefore, we use them as cell markers for AT2 cells. As we can see in the boxplot, we can see that for all three gene, the expression level is higher in AT2 cells. However, there is no significant difference between all three genes’ expression level between AT2 and AT1_AT2 because both groups are currently expressing AT2 characteristics.

Hopx, Aqp5, pdpn

Hopx gene encodes for homeodomain-only protein homeobox. Aqp5 gene encodes for Aquaporin-5. Pdpn gene encodes for Podoplanin. These three proteins are highly expressed in AT1 cells; therefore, we use them as cell markers for AT1 cells. As we can see in the boxplot, we can see that for all three gene, the expression level is higher in AT1_AT2 cells. However, there is no significant difference between Hopx and Aqp5 genes’ expression level between AT2 and AT1_AT2 because AT1_AT2 already lost their previous characteristics of AT1 cells. What is interesting is that for the AT1 cell marker Pdpn, there is a significant difference between the two groups. This might suggest that Pdpn differs from Hopx and Aqp5 as cell markers of AT1 cells. For example, high pdpn expression might indicate a specific stage of AT1 cell.

Itgb6

Itgb6 encodes for Integrin Subunit Beta 6 protein. Morris et al. has shown that this protein binds and activates latent transforming growth factor-beta (TGF-beta). By knocking out the Itgb6 gene in mice, they discovered that the absence of Itgb6 causes Loss of integrin alpha(v)beta6-mediated TGF-beta activation causes age-related emphysema. Moreover, they demonstrated that the effects of Itgb6 deletion are overcome by simultaneous transgenic expression of active TGF-beta1 (Morris et al., 2003). Our result showed that AT1_AT2 has a higher expression of Itgb2 compared to AT2, which might suggest that the high level of Itgb2 promotes AT1_AT2 to regenerate after injury.

Ctgf

Ctgf encodes for connective tissue growth factor (CTGF). Researchers have shown that CTGF is a mediator of growth factor activity. Kablar et al. showed that lungs from homozygous knockout of Ctgf mice were hypoplastic, with reduced cell proliferation and increased apoptosis. Their finding suggests that the absence of CTGF may trigger pulmonary hypoplasia by both disrupting basic lung developmental processes and restricting thoracic expansion (Kablar et al., 2008). ). Our result showed that AT1_AT2 has a higher expression of Ctgf compared to AT2, which might suggest that the increased expression of Ctgf results in higher cell proliferation and decrease in apoptosis for AT1_AT2 cells.

Hbegf:

Hbegf gene encodes for Heparin-binding epidermal growth factor. It is an activating ligand for EGF receptor. Most homozygous knockouts of Hbegf gene result in prenatal death. Hbegf(-/-) newborns have poorly differentiated lungs (Sandilands et al., 2013). Our result showed that AT1_AT2 has a higher expression of Hbegf compared to AT2, which might suggest that the high level of Hbegf endows the AT1_AT2 cells with a great ability to differentiate, and that is why they are able to switch cell fates.

Tgfb2:

Tgfb2 encodes for Transforming growth factor beta. Proper lung development replies on the precise temporal and spatial expression of several morphogenic factors, including Fgf10, Fgf9, Shh, Bmp4, and Tgf-beta. Upregulation and downregulation of the tgfb2 gene often leads to abnormal embryonic or postnatal lung development. Homozygous knockout embryos of tgfb2 developed severe lung hypoplasia and died at birth with enormously reduced lung size (Noe et al., 2019). Our result showed that AT1_AT2 has a higher expression of Tgfb2 compared to AT2, which might suggest that the higher expression of Tgfb2 promotes lung development.

Gene-set enrichment analysis

Out of all 57 significant pathways, I picked 10 outstanding ones and showed the bar plot. These 10 pathways can be categorized into 3 main groups. The first one is cell cycle. All these pathways are upregulated as shown in the leading genes. This makes sense because we know that for AT1_AT2, because they can change their cell fate, they must have greater regenerative potential. In order to proliferate and repopulate the damaged cells, they have to have higher turnover rate, which means faster cell cycles, higher DNA replication rate, and more programmed cell death. The second group is cell interaction with the ECM. All these related pathways are also upregulated. I think a potential reason for this upregulation is because the interactions between AT1 cells and ECM gave the AT1 cells the ability to change cell fate. In order to receive the signal to regenerate and proliferate, there needs to be more interactions between cell and ECM.

The last group is WNT signaling pathway. My current project is on the canonical pathway of WNT signaling. This pathway is characterized by two on and off states. The off state is characterized by the absence of WNT ligands, which leads to the phosphorylation of the β-catenin by the destruction complex composed of axis inhibition protein (AXIN), adenomatous polyposis coli (APC), glycogen synthase kinase 3 beta (GSK3β) and casein kinase (CK1α). The β-catenin is ubiquitinated by beta-transducin repeat containing E3 ubiquitin protein ligase (β-TrCP) and signaled for proteasomal degradation. If β-catenin is not present in the nucleus, a repressive complex containing T-cell factor (TCF) and lymphoid enhancer factor (LEF) will recruit histone deacetylase (HDAC) to shut down expression of target genes. When WNT proteins are present and bind to the Frizzled receptors (FZD) and low-density lipoprotein receptor-related protein 5/6 (LRP 5/6) co-receptors on the cell surface, WNT-signaling switches to the on state. The LRP5/6 receptors become phosphorylated and inactive the destruction complex. As a result, β-catenin will not become degraded and get translocated into the nucleus. β-catenin will then form active complex with TCF, LEF, and other components and in turn activates the downstream gene translations (Nusse et al., 2017; Zhan et al., 2017).

Our results showed that in AT1_AT2 there is a decreased regulation of FZD by ubiquitination and a decreased deactivation of the beta-catenin transactivating complex. If ubiquitination of FZD happens, then the FZD receptors will be degraded by the proteosome, the WNT ligands cannot bind to the FZD receptors, and WNT signaling will decrease. In cell nucleus, beta-catenin binds to the beta-catenin transactivating complex to activate downstream signaling. A decrease in deregulation of beta-catenin transactivating complex means activation of WNT signaling pathways. Both show that an upregulation of WNT signaling pathway will show an increase in cell regeneration and proliferation ability.

Reference

Baguma-Nibasheka, M., & Kablar, B. (2008). Pulmonary hypoplasia in the connective tissue growth factor (Ctgf) null mouse. Developmental Dynamics, 237(2), 485–493. https://doi.org/10.1002/dvdy.21433

Barkauskas, C. E., Cronce, M. J., Rackley, C. R., Bowie, E. J., Keene, D. R., Stripp, B. R., Randell, S.H., Noble, P. W., Hogan, B. L. et al. (2013). "Type 2 alveolar cells are stem cells in adult lung." J Clin Invest 123(7): 3025-3036.

Davidson, L., and Berkelhamer, S. (2017). Bronchopulmonary dysplasia: Chronic lung disease of infancy and long-term pulmonary outcomes. Journal of Clinical Medicine 6, 4.

Morris, D. G., Huang, X., Kaminski, N., Wang, Y., Shapiro, S. D., Dolganov, G., Glick, A., & Sheppard, D. (2003). Loss of integrin αvβ6-mediated TGF-β activation causes mmp12-dependent emphysema. Nature, 422(6928), 169–173. https://doi.org/10.1038/nature01413

Mucenski, M.L., Wert, S.E., Nation, J.M., Loudy, D.E., Huelsken, J., Birchmeier, W., Morrisey, E.E., and Whitsett, J.A. (2003). β-Catenin is required for specification of Proximal/Distal Cell Fate DURING Lung morphogenesis.Journal of Biological Chemistry278, 40231–40238. Nusse, R., and Clevers, H. (2017). Wnt/β-catenin signaling, disease, and emerging therapeutic modalities. Cell 169, 985–999.

Noe, N., Shim, A., Millette, K., Luo, Y., Azhar, M., Shi, W., Warburton, D., & Turcatel, G. (2019). Mesenchyme-specific deletion of TGF-β1 in the embryonic lung disrupts branching morphogenesis and induces lung hypoplasia. Laboratory Investigation, 99(9), 1363–1375. https://doi.org/10.1038/s41374-019-0256-3

Penkala IJ, Liberti DC, Pankin J, et al. Age-dependent alveolar epithelial plasticity orchestrates lung homeostasis and regeneration. Cell Stem Cell. 2021;28(10):1775-1789.e5. doi:10.1016/j.stem.2021.04.026

Sandilands, A., Smith, F. J., Lunny, D. P., Campbell, L. E., Davidson, K. M., MacCallum, S. F., Corden, L. D., Christie, L., Fleming, S., Lane, E. B., & McLean, W. H. (2013). Generation and characterisation of keratin 7 (K7) knockout mice. PLoS ONE, 8(5). https://doi.org/10.1371/journal.pone.0064404

Thébaud, B., Goss, K.N., Laughon, M., Whitsett, J.A., Abman, S.H., Steinhorn, R.H., Aschner, J.L., Davis, P.G., McGrath-Morrow, S.A., Soll, R.F., et al. (2019). Bronchopulmonary dysplasia. Nature Reviews Disease Primers 5.

Zhan, T., Rindtorff, N., and Boutros, M. (2016). Wnt signaling in cancer. Oncogene 36, 1461–1473.

R version 4.0.3 (2020-10-10)

**Platform:** x86_64-apple-darwin17.0 (64-bit)

locale: en_US.UTF-8||en_US.UTF-8||en_US.UTF-8||C||en_US.UTF-8||en_US.UTF-8

attached base packages: parallel, stats4, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: readxl(v.1.3.1), knitr(v.1.36), pander(v.0.6.4), fgsea(v.1.16.0), biomaRt(v.2.46.3), DESeq2(v.1.30.1), SummarizedExperiment(v.1.20.0), Biobase(v.2.50.0), MatrixGenerics(v.1.2.1), matrixStats(v.0.61.0), GenomicRanges(v.1.42.0), GenomeInfoDb(v.1.26.7), IRanges(v.2.24.1), S4Vectors(v.0.28.1), BiocGenerics(v.0.36.1), viridis(v.0.6.2), viridisLite(v.0.4.0), gplots(v.3.1.1), ggplot2(v.3.3.5), DT(v.0.20), RColorBrewer(v.1.1-2), tidyr(v.1.1.4) and readr(v.2.1.0)

loaded via a namespace (and not attached): colorspace(v.2.0-2), ellipsis(v.0.3.2), XVector(v.0.30.0), rstudioapi(v.0.13), farver(v.2.1.0), bit64(v.4.0.5), AnnotationDbi(v.1.52.0), fansi(v.0.5.0), xml2(v.1.3.2), splines(v.4.0.3), cachem(v.1.0.6), geneplotter(v.1.68.0), jsonlite(v.1.7.2), annotate(v.1.68.0), dbplyr(v.2.1.1), compiler(v.4.0.3), httr(v.1.4.2), assertthat(v.0.2.1), Matrix(v.1.3-4), fastmap(v.1.1.0), cli(v.3.1.0), htmltools(v.0.5.2), prettyunits(v.1.1.1), tools(v.4.0.3), gtable(v.0.3.0), glue(v.1.5.0), GenomeInfoDbData(v.1.2.4), dplyr(v.1.0.7), rappdirs(v.0.3.3), fastmatch(v.1.1-3), Rcpp(v.1.0.7), cellranger(v.1.1.0), jquerylib(v.0.1.4), vctrs(v.0.3.8), crosstalk(v.1.2.0), xfun(v.0.28), stringr(v.1.4.0), lifecycle(v.1.0.1), gtools(v.3.9.2), XML(v.3.99-0.8), zlibbioc(v.1.36.0), scales(v.1.1.1), vroom(v.1.5.6), hms(v.1.1.1), yaml(v.2.2.1), curl(v.4.3.2), memoise(v.2.0.0), gridExtra(v.2.3), stringi(v.1.7.5), RSQLite(v.2.2.8), highr(v.0.9), genefilter(v.1.72.1), caTools(v.1.18.2), BiocParallel(v.1.24.1), rlang(v.0.4.12), pkgconfig(v.2.0.3), bitops(v.1.0-7), evaluate(v.0.14), lattice(v.0.20-45), purrr(v.0.3.4), htmlwidgets(v.1.5.4), labeling(v.0.4.2), bit(v.4.0.4), tidyselect(v.1.1.1), magrittr(v.2.0.1), R6(v.2.5.1), generics(v.0.1.1), DelayedArray(v.0.16.3), DBI(v.1.1.1), pillar(v.1.6.4), withr(v.2.4.2), survival(v.3.2-13), RCurl(v.1.98-1.5), tibble(v.3.1.6), crayon(v.1.4.2), KernSmooth(v.2.23-20), utf8(v.1.2.2), BiocFileCache(v.1.14.0), tzdb(v.0.2.0), rmarkdown(v.2.11), progress(v.1.2.2), locfit(v.1.5-9.4), grid(v.4.0.3), data.table(v.1.14.2), blob(v.1.2.2), digest(v.0.6.28), xtable(v.1.8-4), openssl(v.1.4.5), munsell(v.0.5.0) and askpass(v.1.1)